Final Project
This course can be found in udacity ud170.
Data Analysis Process
Setting Up Your System
Otherwise, you can find the free course here.
Intro to CSVs
If you’d like to learn more about data wrangling, check out the Udacity course Data Wrangling with MongoDB.
CSVs in Python
https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/22sQCo6ovH0.mp4
|
|
|
|
Python’s csv Module
This page contains documentation for Python’s csv module. Instead of csv, you’ll be using unicodecsv in this course. unicodecsv works exactly the same as csv, but it comes with Anaconda and has support for unicode. The csv documentation page is still the best way to learn how to use the unicodecsv library, since the two libraries work exactly the same way.
Iterators in Python
This page explains the difference between iterators and lists in Python, and how to use iterators.
Solutions
DAND students click here for solution code
IPND students: Look at the end of this lesson for Quiz Solutions
Fixing Data Types
https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/7NSYtdVrlRE.mp4
Questions about Student Data
https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/AO8vSyAtfV4.mp4
Investigating the Data
Now you’ve started the data wrangling process by loading the data and making sure it’s in a good format. The next step is to investigate a bit and see if there are any inconsistencies or problems in the data that you’ll need to clean up.
For each of the three files you’ve loaded, find the total number of rows in the csv and the number of unique students. To find the number of unique students in each table, you might want to try creating a set of the account keys.
Again, in case you’re not finished with your local setup, you can complete this exercise in the Udacity code editor. You’ll need to run the next exercise locally, though, so if you haven’t finished setting up, you should do that now.
|
|
Problems in the Data
Removing an Element from a Dictionary
If you’re not sure how to remove an element from a dictionary, this post might be helpful.
Solutions
DAND students click here for solution code
IPND students: Look at the end of this lesson for Quiz Solutions
Updated Code for Previous Exercise
After running the above code, Caroline also shows rewriting the solution from the previous exercise to the following code:
Missing Engagement Records
Printing a Single Row
This page describes how to use Python’s break
statement, which might be helpful for printing only a single problem record.
Solutions
DAND students click here for solution code
IPND students: Look at the end of this lesson for Quiz Solutions
Checking for More Problem Records
Tracking Down the Remaining Problems
Refining the Question
Exploratory Data Analysis
If you’d like to learn more about the exploratory phase of the data analysis process, check out the Udacity course Data Analysis with R.
Solutions
DAND students click here for solution code
IPND students: Look at the end of this lesson for Quiz Solutions
Getting Data from First Week
https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/adqc5fF5B8Y.mp4
https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/53961386350923
Note that paid students may have canceled from other courses before paying, and the suggested solution will retain records from these other enrollments.
Indulge Curiosity
Exploring Student Engagement
Debugging Data Analysis Code
Lessons Completed in First Week
Number of Visits in the First Week
https://s3.cn-north-1.amazonaws.com.cn/u-vid-hd/5GYA5j1fqBU.mp4
https://classroom.udacity.com/courses/ud170/lessons/5430778793/concepts/53961386450923
Splitting out Passing Students
Quiz: Comparing the Two Student Groups
Quiz: Making Histograms
Visualizing data
Even though you know the mean, standard deviation, maximum, and minimum of various metrics, there are a lot of other facts about each metric that would be nice to know. Are more values close to the minimum or the maximum? What is the median? And so on.
Instead of printing out more statistics, at this point it makes sense to visualize the data using a histogram.
Making histograms in Python
To make a histogram in Python, you can use the matplotlib library, which comes with Anaconda. The following code will make a histogram of an example list of data points called data
.
The line %matplotlib
inline is specifically for IPython notebook, and causes your plots to appear in your notebook rather than a new window. If you are not using IPython notebook, you should not include this line, and instead you should add the line plt.show()
at the bottom to show the plot in a new window.
Making histograms of student data
Now use this method to make a histogram of each of the three metrics we looked at for both students who pass the subway project and students who don’t. That is, you should create 6 histograms. Do any of the metrics have histograms with very different shapes for students who pass the subway project vs. those who don’t?
You can also create histograms of the metrics you explored on your own if you’d like.
Are your Results Just Noise?
Statistics
If you’d like to learn more about statistics, which you can use to rigorously determine how likely it is that your results are due to chance, check out the Udacity courses Intro to Descriptive Statistics and Intro to Inferential Statistics.
Correlation Does Not Imply Causation
Cheese and Bedsheet Tangling
To see the plot shown in the video, as well as many other amusing or strange correlations, check out this website.
A/B Testing
To learn more about using online experiments to determine whether one change causes another, take the Udacity course A/B Testing.
Predicting Based on Many Features
Machine Learning
To learn more about using machine learning to automatically make predictions, take the Udacity course Intro to Machine Learning.
Communication
Quiz: Improving Plots and Sharing Findings
Adding labels and titles
In matplotlib, you can add axis labels using plt.xlabel("Label for x axis")
and plt.ylabel("Label for y axis")
. For histograms, you usually only need an x-axis label, but for other plot types a y-axis label may also be needed. You can also add a title using plt.title("Title of plot")
.
Making plots look nicer with seaborn
You can automatically make matplotlib plots look nicer using the seaborn library. This library is not automatically included with Anaconda, but Anaconda includes something called a package manager to make it easier to add new libraries. The package manager is called conda, and to use it, you should open the Command Prompt (on a PC) or terminal (on Mac or Linux), and type the command conda install seaborn
.
If you are using a different Python installation than Anaconda, you may have a different package manager. The most common ones are pip and easy_install, and you can use them with the commands pip install seaborn
or easy_install seaborn
respectively.
Once you have installed seaborn, you can import it anywhere in your code using the line import seaborn as sns
. Then any plot you make afterwards will automatically look better. Give it a try!
If you’re wondering why the abbreviation for seaborn is sns, it’s because seaborn was named after the character Samuel Norman Seaborn from the show The West Wing, and sns are his initials.
The seaborn package also includes some extra functions you can use to make complex plots that would be difficult in matplotlib. We won’t be covering those in this course, but if you’d like to see what functions seaborn has available, you can look through the documentation.
Adding extra arguments to your plot
You’ll also frequently want to add some arguments to your plot to tune how it looks. You can see what arguments are available on the documentation page for the hist function. One common argument to pass is the bins
argument, which sets the number of bins used by your histogram. For example, plt.hist(data, bins=20)
would make sure your histogram has 20 bins.
Improving one of your plots
Use these techniques to improve at least one of the plots you made earlier.
Sharing your findings
Finally, decide which of the discoveries you made this lesson you would most want to communicate to someone else, and write a forum post sharing your findings.
Data Analysis and Related Terms
Conclusion
Quiz Solutions
CSVs in Python
|
|
Investigating the Data
|
|
Problems in the Data
|
|
Missing engagement records
|
|
Checking for more problem records
|
|
num_problem_students
Refining the Question
|
|
Note that if you switch the order of the second if statement like so
if (enrollment_date > paid_students[account_key] or
account_key not in paid_students)
you will most likely get an error. Why do you think that is? Check out this Stackoverflow discussion to find out more: http://stackoverflow.com/questions/13960657/does-python-evaluate-ifs-conditions-lazily
Getting Data from First Week
|
|
Debugging Data Analysis Code
Here is the code Caroline shows in the solution video:
Alternatively, you can find the account key with the maximum minutes using this shorthand notation:
Fixing Bug in within_one_week()
She also updated the code for the within_one_week
function to the following:
Lessons Completed in First Week
First, Caroline refactors the given code to analyze total minutes spent in the first week into the following:
Then she called the functions she created to analyze the lessons completed in the first week as follows:
Number of Visits in the First Week
Here is the code Caroline shows in the solution video. First she ran this code to create the has_visited
field:
Then, after recreating the engagement_by_account
dictionary with the updated data, she ran the following code to analyze days visited in the first week:
Splitting out Passing Students
Here is the code Caroline shows in the solution video:
Comparing the Two Student Groups
Here is the code Caroline shows in the solution video:
Making Histograms
Here is the code Caroline shows in the solution video:
Fixing the Number of Bins
To change how many bins are shown for each plot, try using the bins
argument to the hist
function. You can find documentation for the hist
function and the arguments it takes here.
Improving Plots and Sharing Findings
Here is the code Caroline shows in the solution video:
Quiz: Survey Says!
Numpy and Pandas for 1D Data
Introduction
Quiz: Gapminder Data
Gapminder data
The data in this lesson was obtained from the site gapminder.org. The variables included are:
Aged 15+ Employment Rate (%)
Life Expectancy (years)
GDP/capita (US$, inflation adjusted)
Primary school completion (% of boys)
Primary school completion (% of girls)
You can also obtain the data to anlayze on your own from the Downloadables section.
One-Dimensional Data in NumPy and Pandas
Quiz: NumPy Arrays
Pandas | Numpy |
---|---|
Series | Array |
similarity and difference between numpy array and python list
similarity | difference |
---|---|
for loop | numpy array have the same type |
|
|
solution
argmax()
return the position of max()
Quiz: Vectorized Operations
+
operation:
python | numpy
—|—
list concatenation | vector addition
Quiz: Multiplying by a Scalar
Quiz: Calculate Overall Completion Rate
Bitwise Operations
See this article for more information about bitwise operations.
In NumPy, a & b
performs a bitwise and of a
and b
. This is not necessarily the same as a logical and, if you wanted to see if matching terms in two integer vectors were non-zero. However, if a
and b
are both arrays of booleans, rather than integers, bitwise and and logical and are the same thing. If you want to perform a logical and on integer vectors, then you can use the NumPy function np.logical_and(a, b)
or convert them into boolean vectors first.
Similarly, a | b
performs a bitwise or, and ~a
performs a bitwise not. However, if your arrays contain booleans, these will be the same as performing logical or and logical not. NumPy also has similar functions for performing these logical operations on integer-valued arrays.
For the quiz, assume that the number of males and females are equal i.e. we can take a simple average to get an overall completion rate.
In the solution, we may want to / 2.
instead of just / 2
. This is because in Python 2, dividing an integer by another integer (2) drops fractions, so if our inputs are also integers, we may end up losing information. If we divide by a float (2.
) then we will definitely retain decimal values.
Erratum: The output of cell [3] in the solution video is incorrect: it appears that the male
variable has not been set to the proper value set in cell [2]. All values except for the first will be different. The correct output in cell Out[3]:
should instead start with:
array([ 192.83205, 205.28855, 202.82258, 186.63257, 206.91115,
|
|
solution
Quiz: Standardizing Data
quiz
solution
Quiz: NumPy Index Arrays
quiz
sloution
Quiz: + vs. +=
notice
array([2,3,4,5])
array([1,2,3,4])
Quiz: In-Place vs. Not In-Place
notice
array([100,2,3,4])
Quiz: Pandas Series
quiz
solution
Quiz: Series Indexes
s.describe()
s.loc[INDEX]
s.iloc[0]
Pandas idxmax()
Note: The argmax()
function mentioned in the videos has been realiased to idxmax()
, and returns the index of the first maximally-valued element. You can find documentation for the idxmax()
function in Pandas here.
quiz
solution
Quiz: Vectorized Operations and Series Indexes
quiz
Quiz: Filling Missing Values
Remember that Jupyter notebooks will just print out the results of the last expression run in a code cell as though a print expression was run. If you want to save the results of your operations for later, remember to assign the results to a variable or, for some Pandas functions like .dropna()
, use inplace = True
to modify the starting object without needing to reassign it.
quiz
solution
Quiz: Pandas Series apply()
Note: The grader will execute your finished reverse_names(names)
function on some test names
Series when you submit your answer. Make sure that this function returns another Series with the transformed names.
split()
You can find documentation for Python’s split()
function here.
quiz
solution
Quiz: Plotting in Pandas
If the variable data
is a NumPy array or a Pandas Series, just like if it is a list, the code
will create a histogram of the data.
Pandas also has built-in plotting that uses matplotlib behind the scenes, so if data
is a Series, you can create a histogram using data.hist()
.
There’s no difference between these two in this case, but sometimes the Pandas wrapper can be more convenient. For example, you can make a line plot of a series using data.plot()
. The index of the Series will be used for the x-axis and the values for the y-axis.
In the following quiz, we’ve created Series containing the various variables we’ve been looking at this lesson. Pick a country you’re interested in, and make a plot of each variable over time.
The Udacity editor will only show one plot each time you click “Test Run”, so you can look at multiple plots by clicking “Test Run” multiple times. If you’re running plotting code locally, you may need to add the line plt.show()
depending on your setup.
quiz
solution
Conclusion
Numpy and Pandas for 2D Data
Introduction
Quiz: Subway Data
Quiz: Two-Dimensional NumPy Arrays
python: list of lists
numpy: 2D array
pandas: DataFrame
This page describes the memory layout of 2D NumPy arrays.
quiz
solution
Quiz: NumPy Axis
axis = 0 column
1 row
quiz
solution
NumPy and Pandas Data types
Quiz: Accessing Elements of a DataFrame
quiz
solution
Loading Data into a DataFrame
Quiz: Calculating Correlation
Understand and Interpreting Correlations
This page contains some scatterplots of variables with different values of correlation.
This page lets you use a slider to change the correlation and see how the data might look.
Pearson’s r only measures linear correlation! This image shows some different linear and non-linear relationships and what Pearson’s r will be for those relationships.
Corrected vs. Uncorrected Standard Deviation
By default, Pandas’ std()
function computes the standard deviation using Bessel’s correction. Calling std(ddof=0)
ensures that Bessel’s correction will not be used.
Previous Exercise
The exercise where you used a simple heuristic to estimate correlation was the “Pandas Series” exercise in the previous lesson, “NumPy and Pandas for 1D Data”.
Pearson’s r in NumPy
NumPy’s corrcoef() function can be used to calculate Pearson’s r, also known as the correlation coefficient.
quiz
solution
Pandas Axis Names
|
|
Quiz: DataFrame Vectorized Operations
Pandas shift()
Documentation for the Pandas shift() function is here. If you’re still not sure how the function works, try it out and see!
Alternative Solution
As an alternative to using vectorized operations, you could also use the code return entries_and_exits.diff()
to calculate the answer in a single step.
quiz
solution
Quiz: DataFrame applymap()
Note: The grader will execute your finished convert_grades(grades)
function on some test grades
DataFrames when you submit your answer. Make sure that this function returns a DataFrame with the converted grades. Hint: You may need to define a helper function to use with .applymap()
.
quiz
solution
Quiz: DataFrame apply()
Note: In order to get the proper computations, we should actually be setting the value of the “ddof” parameter to 0 in the .std()
function.
Note that the type of standard deviation calculated by default is different between numpy’s .std()
and pandas’ .std()
functions. By default, numpy calculates a population standard deviation, with “ddof = 0”. On the other hand, pandas calculates a sample standard deviation, with “ddof = 1”. If we know all of the scores, then we have a population - so to standardize using pandas, we need to set “ddof = 0”..apply()
used to convert columns(default) to columns and convert rows(with axis) to rows.applymap()
used to elements
quiz
solution
Quiz: DataFrame apply() Use Case 2
.apply()
convert columns to elementdf.apply(np.max):=df.max()
quiz
solution
Quiz: Adding a DataFrame to a Series
code
Quiz: Standardizing Each Column Again
quiz
solution
Quiz: Pandas groupby()
code
Quiz: Calculating Hourly Entries and Exits
In the quiz where you calculated hourly entries and exits, you did so for a single set of cumulative entries. However, in the original data, there was a separate set of numbers for each station.
Thus, to correctly calculate the hourly entries and exits, it was necessary to group by station and day, then calculate the hourly entries and exits within each day.
Write a function to do that. You should use the apply()
function to call the function you wrote previously. You should also make sure you restrict your grouped data to just the entries and exits columns, since your function may cause an error if it is called on non-numerical data types.
If you would like to learn more about using groupby()
in Pandas, this page contains more details.
Note: You will not be able to reproduce the ENTRIESn_hourly and EXITSn_hourly columns in the full dataset using this method. When creating the dataset, we did extra processing to remove erroneous values.
quiz
To clarify the structure of the data, the original data recorded the cumulative number of entries on each station at four-hour intervals. For the quiz, you just need to look at the differences between consecutive measurements on each station: by computing “hourly entries”, we just mean recording the number of new tallies between each recording period as a contrast to “cumulative entries”.
solution
Quiz: Combining Pandas DataFrames
In the merged table on the right, the join dates in the third and fourth rows should be 5/19 and 5/11, reflecting the account key mapping in the enrollments table.
quiz
solution
Quiz: Plotting for DataFrames
Just like Pandas Series, DataFrames also have a plot() method. If df
is a DataFrame, then df.plot()
will produce a line plot with a different colored line for each variable in the DataFrame. This can be a convenient way to get a quick look at your data, especially for small DataFrames, but for more complicated plots you will usually want to use matplotlib directly.
In the following quiz, create a plot of your choice showing something interesting about the New York subway data. For example, you might create:
Histograms of subway ridership on both days with rain and days without rain
A scatterplot of subway stations with latitude and longitude as the x and y axes and ridership as the bubble size
If you choose this option, you may wish to use the as_index=False
argument to groupby(). There is example code in the following quiz.
A scatterplot with subway ridership on one axis and precipitation or temperature on the other
If you’re not sure how to make the plot you want, try searching on Google or take a look at the matplotlib documentation. Once you’ve created a plot you’re happy with, share what you’ve found on the forums!
quiz
solution
Three-Dimensional Data
Three-Dimensional Data
Now that you’ve worked with one-dimensional and two-dimensional data, you might be wondering how to work with three or more dimensions.
3D data in NumPy
NumPy arrays can have arbitrarily many dimensions. Just like you can create a 1D array from a list, and a 2D array from a list of lists, you can create a 3D array from a list of lists of lists, and so on. For example, the following code would create a 3D array:
3D data in Pandas
Pandas has a data structure called a Panel, which is similar to a DataFrame or a Series, but for 3D data. If you would like, you can learn more about Panels here.